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Abstract 

In this study, new highly parallel algorithm of two-sided 
Jacobi 8-D transformation is suggested. It is oriented on 
VLSI-implementation of special processor array. This array 
is built using 8-D CORDIC algorithm for quaternion valued 
matrix singular value decomposition. Accuracy analysis and 
simulation results are added. Such array can be utilized to 
speed up the Jacobi method realization to compute the SVD 
of a quaternion matrix in signal and image processing. 
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Introduction 

Singular value decompositions (SVD) and eigenvalue 
decomposition (EVD) are important problems in linear 
algebra and digital signal processing. The most widely 
employed method of solving SVD problem is zeroing 
the entries in a matrix by a sequence of rotations or 
reflections (see Rader [1996]). In many such 
computations, it is necessary to vanish some elements 
selectively with matrix similarity transformation. High 
practical importance of these problems leads to the 
need of designing hardware-oriented algorithms for 
their fast VLSI implementation. The Jacobi method is 
the frequently used tool and very high throughputs 
are required from the special processors which will be 
used as array's elements for the parallel computation 
of these transformations. 

The coordinate rotation computer (CORDIC) is an 
effective processing element for such systolic arrays 
which executes these transformations without 
multiplications (see Meher [2009]). The original 
CORDIC algorithm processes two real numbers at a 
time. It is well below the needed throughput for DSP. 
To speed up the matrix computations, they use higher 
dimensional rotations. To generalize the original 
CORDIC algorithms, Hsiao and Delosme offered the 



Quaternion or Pseudo-quaternion CORDIC algorithms 
for 4-D rotations(see Hsiao [1996]). These algorithms 
are used for parallel decomposition of complex 
matrices and demonstrate significant speed up in 
contrast to the original 2-D CORDIC. Following this 
direction, Octonion CORDIC algorithm (OCA) for 8-D 
rotations was suggested in Doukhnitch [2002] and 
generalized in Doukhnitch [2011] to implement Givens 
rotation applied to quaternion valued vectors. 

The study of quaternion matrices has gained interest 
in many practical areas in recent years (see Bihan, 
Miron, Zhang). It is motivated with smaller 
complexity in terms of storage and computations 
needed for a direct quaternion algorithm (seeBihan 
[2007]). Besides that, use of quaternion operands for 
computations leads to less computational errors. The 
application of quaternions to represent color images 
has been introduced by Sangwine [1996]. A color 
image is represented as a pure quaternion image: 

S(x,y) =r(x,y)i + g(x,y)j + b(x,y)k, (1) 

wherer(x,y), g(x,y), b(x,y) are respectively the red, 
green and blue components of a pixel at position (x,y) 
in the image S(x,y). This representation has allowed 
the definition of powerful tools for color image 
processing such as Fourier transforms, compression, 
correlation, recognition or edge detection including 
using SVD (see Bihan[2003], Pei [2003]). In the last few 
years, quaternions were used in statistical signal 
processing for multiple sources characterization and 
discrimination. 

In accordance with Jacobi method, they divide the 
rows of a given matrix into pairs and for each pair 
they calculate an angle for two-sided rotation to null 
off-diagonal element. In 1983, the Brent-Luk-Van Loan 
(BLV) systolic array was proposed for fast SVD 
parallel computation, see Brent [1983]. In Cavallaro 
[1988], there was suggested to use CORDIC processors 
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for such array. B.Yang and J.F. Bohme proposed the 
Two Plane Rotation (TPR) method to perform two- 
sided 2x2 SVD as two parallel rotations, see Yang 
[1991]. Hsiao and Delosme proposed to use 
multidimensional CORDIC algorithms in systolic 
array for both real and complex matrices (see Hsiao 
[1996]). There are many modifications of Jacobi 
method realization in publications. For example, the 
approximate CORDIC-based Jacobi algorithm was 
considered in Gotze [1993]. In Strumpen [2003], a 
stream SVD algorithm was suggested. In Snopce 
[2010], a preliminary complex-to-real transformation 
was proposed. 

H. C. Lee and then F. Zhang [1997] suggested 
calculating SVD of quaternion matrices by means of 
the SVD of its equivalent complex matrix with twice 
size (see also Bihan [2003], Pei [2003]). The two-sided 
Jacobi algorithm has also been used to solve SVD of a 
quaternion matrix directly without such decomposition 
(see Bihan [2007]). 

In this paper, we suggest to use Octonion CORDIC 
algorithm for hardware realization of Jacobi 
transformation with BVL processor array. The only 
difference is in new processor elements (PEs). They 
can speed up the process because of parallel 
calculation of elements of 8-D vectors (pairs of 
quaternions) and alignmentat runtime of angle 
calculation and rotations implementation. But it 
needssignificant modification of diagonal PEs. It is due 
to the fact that OCA does not have an evident 
representation of angle for 8-D rotation and TPR 
method cannot be realized directly. 

In Section 1, a short description of Givens 
transformation with OCA for quaternion operands is 
given. In Section 2 OCA modification for Jacobi and 
TPR methods is described. The algorithm of SVD 
implementation with processor array is suggested in 
Section 3. In Section 4, the error analysis together with 
simulation results are shown. Finally, in the last 
section, the conclusions are given. 

Quaternion Vector Transformation with OCA 

The typical matrix operation for a matrix 
decomposition is an one-sided linear transformation of 
rotation: 

Y=P*X (2) 
If we have a quaternion valued vector X=(qi, <p) T in (2) 
we can describe an 8-D rotation of equivalent 8-D real 
vector X=(qu, qn, qn, qu, qn, qn, qn, qn) 1 , where 
quaternion / has components q\=( qn, qn, qa, qu) (1=1,2). 



The Cayley numbers(see Ward [1997])or octonions (8-D 
objects)can be used to represent 8x8 matrix P as a 
product of octonions of elementary rotations: 

p=r\ n M t o) 

These octonions have a unit norm (Nm=1), and can be 
represented in polar form: 

M t = cos (pi + sin (pi (ai + /3j + yk + SI + Aq 

+ pir + ps) ( ' 

where (p t = atan2~ l . Following Doukhnitch [2002], the 
8-D rotation matrix can be represented as Rs,i=(l/coscf>i) 
Mior 

R 8,i = 

■ 1 (Xjtj Pjti Yjtj 8iti Xjti l^i Pjtj ■ 

-txjti 1 —5^! -p.tj Pjtj — n,ti XtU Yjti 

—Pits 1 -Xitj -«iti Yjti -p.ti M,ti 

-Yjti Pjti ^tj 1 -p-jtj -Pjtj 8jtj -tXjti 

-Sjtj -Pjtj cqti Hjtj 1 - Pj ti -Yjtj Xjtj > 

-Xjtj |xtj -Yjtj Pjtj Pjtj 1 -oijtj -Sjtj 

-Hjtj -Xjtj Pjtj -5jtj ^tj cqti 1 -Pjtj 

-Pjti -Yjtj -Hjtj (Xjtj -^tj 8jtj Pjtj 1 . 

The scaling factor is fc=l/cos</>;= aA + ^ - Many 
strategies were proposed to force k — fl; fc; to be a 
simple signed bits representation for efficient scaling 
correction, e.g. Yang [1991] or Cavallaro [1988]. All of 
them use additional sequence of shift-add operations 
with average total number n/4. 

The rotation parameters f; are equal to 2 ^ y where {/(;')} 
is a non-decreasing positive integer sequence: 

f(i) ={0,1* 7 2* 7 3* 7 4* 7 5,6 7 7* 7 8 13*,14 n} (6) 

where the sign * denotes a repetition. Thus, for n-bits 
accuracy (n>13), the total number of iterations is (n+6) 
with guaranteed convergence (see Doukhnitch [2011]). 

The control signs at...p are either 1 orl. Without 
performing the scaling by kr 1 , the implementation of 
one elementary rotation consists of eight concurrent 
shift-and-add operations. The Octonion CORDIC 
algorithm: 

Ym = Rs,i Yi, ( i = 0, n ; Yo=X), ( 7 ) 
as well as original CORDIC algorithm can be used in 
two modes-rotation (application of transformation) and 
vectoring (evaluation of parameters of transformation). 
The usual task of vectoring in such algorithms is the 
annihilation of required components in a given vector, 
as after Voider' s plane rotation, one of the two 
components becomes zero, while in 8-D space, seven 
components may be zero. A sequence of elementary 
rotations with matrices (5) is applied to an 8-D vector 
X= [xi, xi, X3, X4, xs, xa, X7, xs] T to bring it along the first 
canonical axis. To achieve this, the control signs are 
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selected for (5) according to the following expressions: 



(8) 



on =/i-sign(z2,i); pi= /i-sign(z3,i); 

yp= /i-sign(z4,;); &= fi- sign(zs,>); 
Xi =/;-sign(z6,i); /m =/;-sign(z7,>); 

pi =/i-sign(zs,i); /; = sign(z:u); 
where zy,; denotes the ;th component of a vector Z; at 
the beginning of the (;+l)th iteration. For vectoring 
mode Zi=Yi (Yo=X), and for rotation mode, the control 
signs can be obtained from a preliminary vectoring 
operation for some vector Z on-the-fly. The result of (7) 
is 



(9) 

The hardware implementation of the OCA taken from 
Doukhnitch [2011]is shown in Fig. 1. 

The processor is composed of shifters, which performs 
the multiplication using U, 7-to-2 Carry Save Adder 
(CSA) arrays, and 3-input full adders. The Si,S2, ...,Ss at 
the top of the figure are the components where the 
required control signs for the next iteration are 
prepared, the details of which are given in (8). Owing 
to their ineffectiveness in the sense of execution time 



with respect to our design and to sustain the 
traceability of the figure, these components are shown 
separately. The control signs en... pi are used 
subsequently to affect the signs of the elements of 8-D 
vector Yi. The I/O registers are designed such that the 
data transfer into and out of the processor can take 
place simultaneously. In Fig. 1, components labelled as 
-1 are used to change a sign of operand. 

Quaternion J acobi Transformation 

In this paper, as a particular case, we cosider only 
Hermitian matrixes because they have own important 
applications though the offered approach can be 
extended to a case of rectangular matrixes. The Jacobi 
procedure problem is as follows. A 2x2 block of a 
Hermitian quaternion m* mmatrix A is given: 

A rc = \ arr arc \ (10) 
\_(t cr a cc j 

where quaternion (bold face symbol) a cr — a rc and 
diagonal elements are real. The over bar denotes 
(complex or quaternion) conjugation. In Zhang [1997], 
it was shown that a quaternion Jacobi rotation is 
analogous to a complex Jacobi rotation. 



z 3,irS a Z3,i-S 3 Z 4 ,i-S4 Z 5 j-S5 



P 




FIG. 1. OCTONION CORDIC-PROCESSOR ARCHITECTURE 
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Using a 2 x 2 quaternion unitarymatrix P, we can 
perform a Jacobi rotation on An, i.e. 

P*A rc P = R, (11) 
where* stands for conjugation-transposition and R is a 
2x2 quaternion diagonal matrix with real components 
only: 



R = 











A general 2x2 unitary quaternion matrix P has 
elements which have moduli that are cosines and sines 
of some angle 6: 

with|c| = cosi!©, |s| = sin 9 and the constraint 



(12) 



cc + 55=1. The elements (real c and quaternion s) can 
be calculated as following Bihan [2007]: 

1 sin0 
c = cos 9 = , ; 5 = a rc (13) 



where 



VTT7 



t = 



signr 



|t| +VTT^' 



t — cot 2 9 — ■ 



-; sin0 = tc 



21a,... 

Two-sided rotation (11) can be executed with two 8-D 
rotations in sequence but the angle 6 of rotation must 
be calculated before. This angle can be represented in 
no evident form (in an implicit fashion)as a sequence of 

sets of control signs {ca...pi}e ( ? = 0> n )(see Delosme 
[1990]). In general, for a Hermitian matrix, to calculate 
an angle = atany/x we can apply algorithm (7) in 
vectoring mode with quaternion vector X= [x, y] T and 
the corresponding sequence will be generated. If we 
take X—Qjj-Qcc and y= aa+a rc we can calculate a sequence 

of sets { as. . .pi}ze ( * = 0, n ) corresponding to angle 26. 

To determine the angle 6, this sequence can be 
transformed using a formula: 



a sin a 

tan 77 = z 

2 1 + cos a 



(14) 



First, this sequence is applied to rotate an unit vector 
X= [1, 0, 0, 0, 0, 0, 0, 0] T to get two quaternion 
components of matrix (10) in the result Y= [c, s] T and 
then vectoring mode is utilized for modified vector 

X= [c+1, s] T (15) 
As a result, we will get a sequence of sets {ca...pi}e 

( i = 0, n ) corresponding to angle 6. This sequence can 
be used for orthogonal transformation of matrix A. 

To speed up the considered calculations, we can use 
Two Plane Rotation (TPR) method to perform 2x2 SVD 



A A I A [Pi ~ 0l l , \~? 2 ^21 

A rc = A 1+ A 2 = [ qi pi \ + [ q2 J, 



(16) 



as two rotations in parallel suggested in Yang [1991]. 
This method represents 2x2 matrix (10) as a sum: 

rPi -°H , \~P2 921 
92 Pi\ 

where pi=(a rr +acc)/2, qi=(a C r- a^~ c )/2, p2=(acc-a,r)/2, 
q2=(aci+a rc )/2. 

In general case, the transformation (11) can be 
executed as two rotations in vectoring mode for 
quaternion vectors Xi=(pi, qi) J and Xi=(p2, qi) 1 . As a 
result, the matrix R is calculated as 

L ° yu+y2iJ 

where yu and yn are the first components of the 
resulting vectors Yi and Yi correspondently. In 

addition, two sequences of sets [ca...fx\ (i = 0,n ) 
corresponding to angles 0+ and 6- for TPR are 
produced. Then we can use two angles for two-sided 
rotation (11)0! = (0 + - 6L)/2 and 9 2 = (0 + + 6»_)/2 to 
modify off-diagonal elements of A(two rows and two 
columns) . 

Actually, for Hermitian matrices, we have qi=0 and the 
first rotation is unnecessary (6-=0, yn= pi, and 9 X — 
9 2 = 9 + /2). The elements pi and pi are real and qi 
=alf is a quaternion. Therefore, we can use sets {at. . .pi) 

( i = 0, n ) corresponding to angle d+ to rotate an unit 
vector for generating vector (15). Then the sequence of 

sets {ca...pi}e ( z = 0, n ) can be obtained corresponding 
to angle 61=62 ior rotation matrix P which is used for 
orthogonal transformation of m*m matrix A. 

SVD Processor Array 

The Jacobi algorithm (Rutishauser[1971])consists of 
iteratively applying a basic 2x2 diagonalization 
formula (11) until the entire wx m matrix is diagonal. It 
works in several 'sweeps' until convergence is 
achieved. In each sweep, it rotates away all the non- 
zero off-diagonalelements. But every such rotation of 
course creates other non-zerooff-diagonal elements. It 
can be shown, however, that the sum of the absolute 
values of the off-diagonalelements is reduced in each 
sweep. More precisely, the Jacobi method has 
quadratic convergence. Convergence is achieved in 
O(log m)sweeps(Rutishauser [1971]). 

The algorithm for hardware realization of Jacobi 
transformation can be described as follows: 

Algorithm ]A: 

s=0; A< S >=A; 

.... .. SUM^ , 

While > £ { 

SUM),' 
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For step =1 to m-1 1 

Generation of ml 2 pairs (r, c); 

For each pair in parallel calculate rotation 

angles (sequence ofsetsi at. . .ptje); 

rotate:A( s+ V = P&*A&P W~ usingTPR 

} -end for 
s=s+l 
} -end while 

where SUMi-sum of off-diagonal moduli of A (s> in 
sweep s, SUM2-sum of on-diagonal moduli of A (s> in 
sweep s, matrix P (s) is a plan rotation in the (r, c) plane 
defined by the parameters(c,s, -s,c) in the (rr,rc,cr,cc) 
entries of an m*-m identity matrix. For serial 
implementation, the simple strategy for generation of 
index pairs (r, c) is to chose the "cyclic-by-rows" 
ordering 

(1,2), (1,3), ...,(1, m), (2,3),...,(m-2, m ). 
For parallel implementation, we can use "Brent-Luk" 
ordering (see Brent [1983], [1985]). A sweep for (m-1) 
steps has been executed with mil parallel annihilations 
per each. 




FIG. 2. TRIANGLE PROCESSOR ARRAY FOR JACOBI SVD 

Parallel implementation of m*m matrix singular value 
decomposition is based on square BLV array with mil 
processor elements (PE) per side, where each 
processor contains one 2x2 submatrix, e.g. see Brent 
[1985]. Taking into account the equality a rc — a cr for 
Hermitian matrix, a triangle processor array can be 
used with one diagonal PE and (m/2-1) off-diagonal 
PEs per horizontal side, and one diagonal PE and (m/2- 

1) off-diagonal PEs per vertical side (see Brent [1985]). 
The total number of PEs is m/2diagonal PEs and (m/2- 

2) xm/4off-diagonal PEs. The configuration of this array 
is shown in Fig. 2, where crosses are matrix quaternion 
elements. All the PEs have nearest- neibour 
connections for permutations (17), and details are 
given in Brent [1985]). A phase of internal 



rearrangement of the elements of Arc must be made 
after each step,see also Ma [1999]. Each diagonal PE is 
connected with all off-diagonal PEs in the row and in 
the column to send them the sequence of sets {ca...pi}e 
for rotations. 

All array PEs are built using module for OCA 
implementation shown in Fig. 1. Each PE performs a 
Jacobi transformation for 2x2 matrix with quaternion 
elements. In Fig. 3 and Fig. 4, Octonion CORDIC 
application and evaluation symbols are shown 
correspondently. 




FIG. 3.0CTONION CORDIC APPLICATION 



X=[x 1/ x 2 ,...,x s ]^ 



OCA 
eval 



n Y=[y,,0,0,0,0,0,0,0] T 

/ |» I»i Pi>a 



FIG. 4. OCTONION CORDIC EVALUATION 




mi OCA! 

eval 



X=[1,0,0,...,0] T 



Y 2 =[y 2 iAO,...,o] 

[a....p.} w 



» OCA2 

apply 

7 r 1 



1^ 



OCA3 

eval 



¥- 



FIG. 5. DIAGONAL PE OF THE SVD ARRAY 

Diagonal PE consists of 2 OCA modules described in 
Fig.l (OCAi and OCAs) to execute CORDIC 
evaluations (see Fig.5) and one module OCA2 to rotate 
vector X= [1, 0, 0, 0, 0, 0, 0, 0] T to get two quaternion 
components of matrix (12) in the result Y= [c, s] T . This 
result is used to calculate vector (15) for module OCA3 
to generate evaluated control signs and to send them 
to the off-diagonal PEs in the same row and the same 
columnon-the-fly. As soon as the control signs are 
available, the off-diagonal processors can perform the 
CORDIC applications for 2x1 quaternion vector as 
shown in Fig.3, where 

Y=PX,forX= [a rl , a cl ] T , l=c+l,c+2, ...,m, 

and for X= [a er , a ec ]\ e=l,2,...,r-l (17) 

However, to organize a parallel processing for all PEs 
we should use the TPR for off-diagonal PEs as well. 
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Each off-diagonal PE represents a submatrix 

as a sum (16) and then executes TPR: 

[r^tif = P(e_)[p 1 ,q 1 ] r > 

[r 1)tl ] r = P(6 + ) [p^f, (19) 
where pi=(au+a2i)l2, qi=(azi- )/2, pi=(a22-au)l2, 
q2=(a2i+a^)/2,Q + =Q 1 + Q 2 , 0_=0 2 - e t . Angle © 2 
comes from column diagonal PE and the angle Q^s 
comes from row diagonal PE. These angles should be 
represented as two parallel sequences of sets of control 

signs {at...pi}e (* = 0' n ), and in each iteration I the 
OCA module should execute one rotation with set 
{a t ...pjof 2 and one rotation with set [a t —p^ of 
Q 1 changing or not all signs to opposite values for 
addition or for subtraction: 



{a, ... Pi }B 



+/- 



0,n 



) 



(20) 



{a i2 ,a±a n )-P i 2'(±P a )} C 
Therefore, all off-diagonal PEs spend twice time for 
rotation with scaling factor k 2 . 

Then the transformed matrix .4 '2*2 is obtained as: 

a' 11 = r 1 -r 2 ,a\ 2 = -tj + t 2 , 
a'21 = ti + t 2 , a'22 = r 1 +r z . 
The hardware implementation of off-diagonal PE is 
shown in Fig. 6. 

\Pl s 




FIG. 6.OFF-DIAGONAL PE OF THE SVD ARRAY 

Thus, the proposed algorithm provides the following 
levelsof processing: 

transformation ofall elements of the given 
matrix; 

overlapping in time of the rotation angle 
calculation androtation implementation; 

replacement of sequential two-sided rotations 
with two parallel one-sided rotations; 

simultaneous transformation of elements of 8- 
D vectors. 

It is important to stress that the diagonal of resulting 



quaternion matrix has real components only. 

The singular values and eigenvalues are the same for 
Hermitian matrices. So, the eigenvectors for the matrix 
A, which will be identified by B, can be obtained 
following the iterative process also proposed by Jacobi. 
Obtaining the matrix B is executed simultaneously 
with calculating the singular values (see Bravo [2008]). 
The matrix of eigenvectors (columns of B) associated 
to the eigenvalues of A, is also based on the 
decomposition of the matrix in submatrices of 2x2 
elements. In this case, the input matrix to be 
subdivided is the identity matrix. Successive iterations 
will be applied, until the eigenvectors associated are 
found. To implement them, another (the second) 
processor array of size m/2xm/2 can be used with no 
difference between diagonal PEs and off-diagonal PEs. 
The equation to compute for each processor is the 
following: 



B 



0+1) 



dO) p (s) 



(21) 



where P^f is a matrix (12) and corresponds to 
systolic eigenvector processor which performs the 
CORDIC applications for two quaternion vectors with 
two slight modified OCA modules as in Fig. 7. 



The eigenvector PE executes the 
transformation instead of (21) as follows: 



left-side 



,(s+l) 



where C Tr = 



b rr 
-b r 



-b r 
b,, 



rc ^rc ' 



. The result of (21) can be 



achieved with sign changing for elements bn and h 

-h 



K 



OCA 

apply 



br 1 



{a....p.}^ 



OCA 

apply 



— ru b s 



FIG. 7. EIGENVECTOR PE OF THE SVD ARRAY 

The sequence of control signs for rotations is taken 
from corresponding DPEs of the first processor array 
in the same row on-the-fly. After each iteration, a 
process of rearrangement of the elements of each B r( ! 
is made similar to the calculation of eigenvalues. After 
the same number of iterations as that in the 
eigenvalues calculation, the columns of resulting 
matrix B are the eigenvectors associated to the 
eigenvalues determined by the first array. 
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Error Analysis and Simulation Results 

Following Hu [1992] and Paul [1995], we can analyse 
two types of quantization error: an approximation 
error due to the quantized representation of rotation 
angle and a rounding error due to the finite precision 
representation in fixed point arithmetic. 

For fixed point arithmetic, all numbers have to be 
restricted to -1< x <1. Again, the number possesses h 
digits and machine accuracy is given by £ Fi = 2 h l . If 
the result of the fixed point computation lies within 
the range, z = Fi(x ±y) = x + y holds (Paul [1995]). We 
use only operations op e {+,-, shift} calculated with a 
rounding error according to z = Fi(x op y)= x op y + £ Fi . 
Therefore, if an error vector (see Hu [1992]) in each 
step due to rounding is 

e(i) = [e u ,e 2i , e 3i ,e 4i ,e 5i ,e 6i ,e 7i ,e 8i ] 7 ', 
an upper bound for its absolute rounding error is: 

]e(t)l < V8e Fi = 2V2e Fi (22) 
Rotation angle 9 is calculated as a sequence of sets 

{ai...pi\e ( ' ' = 0, n ) with approximation error 6 which is 
bounded by the smallest rotation angle a. That is: 

\b\<o=atan2~ n (23) 
The quantization error of an operation of vector 
rotation is governed by (seeHu [1992]): 

(n— 1 \ n— 1 / n— 1 \ 

H^F + Z \Y\ R8j r (0 + e(n) (24) 
i=0 / i=0 \j=i J 

Scaling correction introduces an additive error (Paul 
[1995]): 

Y = (l/k)Y + Ee fi , (25) 
where E= [1, 1, 1, 1, 1, 1, 1, 1] T . Taking into account (22), 
the worst bound of (25) can be found (seeHu [1992]) 
with 



k?~ 0=0^.^ II ^ 2V2e Fi G(n), 



(26) 



where 



= i + 



n-l 



Vl + 7 * 2~ 2 J 



i=0 1 

For m*m matrices A and P, the 2-norm of the absolute 
error of one sweep of SVD calculation is given 
(following Paul [1995]) by: 

\\A(s) _ p(s)* i 4(0)p(s)|| 



< V2£ Fi G(n)V(2m - 4)m(m ( 27 ) 
-1) 



and an overall error for p sweeps holds: 

p* A (0) p 



A 



< p£ Fi G(n)J (2m — 4)m(m 
- l)/2, 



(28) 



where vl-resulting diagonal matrix containing 
eigenvalues of A. It is necessary to note that formulas 
(27) and (28) can be used with assumption that 
accuracy impacts of "cyclic-by-rows" ordering and 
"Brent-Luk" ordering are the same. 

Simulation was executed using C# in Visual Studio 
2010,. NET Framework 4.0 environment. Fig. 8 
illustrates the evaluation mode of OCA (Fig. 4) as an 

8 

average result of j=2 for 100 pairs of 

quaternions with random elements in range ±1 where 
yj.i denotes the jth component of a vector Y; at the 
beginning of the (f+l)th iteration. 




20 30 
Number of iterations (n) 

FIG. 8.ACCURACY OF OCA EVALUATION VS. NUMBER OF 
ITERATIONS 




3 4 5 

Sweeps number s 

FIG. 9.ACCURACY OF DIAGONALIZATION VS. NUMBER OF 
SWEEPS 

Following Bihan [2007], 1000 random Hermitian 15x15 
quaternion matrices are taken for decomposition. In 
Fig. 9, the diagonalization results are shown as an 

average value of ~ l°gioWi for n=16, 24, 32, 
40,48. 

Table 1 displays the simulation results in contrast to 
software implementation results of Jacobi SVD of 
arbitrary 15 x 15 quaternion matrices in Table 2 from 
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Bihan [2007], where RRE is the relative reconstruction 
error. Following Bihan [2007], in order to check the 
accuracy of the decomposition, we reconstruct the 
original matrix from the product (11), and calculate the 
mean relative modulus error between the elements of 
the original matrix and the reconstructed matrix. This 
is calculated as the mean over all matrix elements of I q 
- q'\ I 0.5 \q + q'\, where q is a quaternion element of 
the original matrix and q' is the corresponding element 
of the reconstructedmatrix. We also note the largest 
relative error across all elements of the matrix. 

The comparison demonstrates that the Jacobi OCA 
implementation with the proposed processor array 
architecture needs almost the same number of sweeps 
as that for software implementation. 



TABLE 1. SIMULATION RESULTS. 



Parameter 


Jacobi 
OCA 


Quaternion 
Jacobi (Bihan 
[2007]) 


Complex 
Jacobi (Bihan 
[2007]) 


Sweeps 


6.1 


6.4 


6.6 


Rotations 


658 


672 


2869 


RRE 


le-16 


1.88e-15 


2.37e-ll 



Conclusions 

In this study, a novel algorithm for VLSI 
implementation of Jacobi SVD for quaternion valued 
matrices was presented. It has a potential of speeding 
up important matrix algorithms such as calculation of 
eigenvalues and eigenvectors for Hermitian matrices. 
This algorithm can be used for triangularization of 
non-Hermitian matrices before SVD calculation as well. 
A possibility to speed up the process using TPR 
method for quaternion matrices has been indicated. 
The simple way to calculate a half angle in no evident 
form (in an implicit fashion) has been developed. A 
new organization of Jacobi SVD processor array to 
implement directly a quaternion matrix diagonalization 
without its decomposition into equivalent complex 
matrix has been described. In addition to 
diagonalization, the calculation of eigenvectors with 
the second processor array in parallel has been shown. 
Error analysis of the suggested algorithm is also given. 
The simulation results have confirmed the possibility 
to use OCA with suggested modifications for Jacobi 
SVD. 

It's crucial to mention that all improvements over the 
basic CORDIC processor (scaling iterations, redundant 
arithmetic, high-radix arithmetic, approximate 
CORDIC-based Jacobi algorithm (e.g. Gotze [1993], 
etc.) may be incorporated into the offered processor 



array as well. Processor arrays with hardware 
implementation of Jacobi OCA can be fruitful in many 
signal-processing problems requiring lots of parallel 
computations. 
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